Property Database FY 16-19

First let’s import the FY16-FY19 property assessment data, provided by the City of Cambridge.

rawPropertyData <- read_delim('Cambridge_Property_Database_FY16-FY19.csv', delim = ',', guess_max = 10000)
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   PID = col_double(),
##   BldgNum = col_double(),
##   StateClassCode = col_double(),
##   LandArea = col_double(),
##   YearOfAssessment = col_double(),
##   BuildingValue = col_double(),
##   LandValue = col_double(),
##   AssessedValue = col_double(),
##   SalePrice = col_double(),
##   PreviousAssessedValue = col_double(),
##   Exterior_NumStories = col_double(),
##   Exterior_WallHeight = col_double(),
##   Exterior_FloorLocation = col_double(),
##   Interior_LivingArea = col_double(),
##   Interior_NumUnits = col_double(),
##   Interior_TotalRooms = col_double(),
##   Interior_Bedrooms = col_double(),
##   Interior_FullBaths = col_double(),
##   Interior_HalfBaths = col_double(),
##   Interior_Fireplaces = col_double()
##   # ... with 7 more columns
## )
## See spec(...) for full column specifications.
# importProblems <- problems(rawPropertyData <- read_delim('Cambridge_Property_Database_FY16-FY19.csv', delim = ','))

It’s having some difficulty importing the Property Tax Amount as the first few thousand records are all integers, and so it infers an integer type. I’ve tried playing around with the guess_max option of read_delim to no avail. For now it’s not really a big deal.

Summary of Raw Data

Let’s take a look at what the data looks like.

summary(rawPropertyData)
##       PID            GISID              BldgNum         Address         
##  Min.   :     3   Length:115838      Min.   : 1.000   Length:115838     
##  1st Qu.:  7118   Class :character   1st Qu.: 1.000   Class :character  
##  Median : 14206   Mode  :character   Median : 1.000   Mode  :character  
##  Mean   : 45961                      Mean   : 1.157                     
##  3rd Qu.: 21316                      3rd Qu.: 1.000                     
##  Max.   :194531                      Max.   :38.000                     
##                                                                         
##      Unit           StateClassCode   PropertyClass         Zoning         
##  Length:115838      Min.   :  13.0   Length:115838      Length:115838     
##  Class :character   1st Qu.: 102.0   Class :character   Class :character  
##  Mode  :character   Median : 102.0   Mode  :character   Mode  :character  
##                     Mean   : 310.1                                        
##                     3rd Qu.: 199.0                                        
##                     Max.   :9971.0                                        
##                                                                           
##    Map/Lot             LandArea       YearOfAssessment  TaxDistrict       
##  Length:115838      Min.   :      0   Min.   :      0   Length:115838     
##  Class :character   1st Qu.:      0   1st Qu.:   2017   Class :character  
##  Mode  :character   Median :      0   Median :   2018   Mode  :character  
##                     Mean   :  11703   Mean   :   5990                     
##                     3rd Qu.:   4300   3rd Qu.:   2019                     
##                     Max.   :7251433   Max.   :2175997                     
##                     NA's   :1056                                          
##  ResidentialExemption BuildingValue         LandValue        
##  Length:115838        Min.   : -1419000   Min.   :    -7600  
##  Class :character     1st Qu.:   329800   1st Qu.:        0  
##  Mode  :character     Median :   526200   Median :        0  
##                       Mean   :  3196949   Mean   :  3704024  
##                       3rd Qu.:   755300   3rd Qu.:   560000  
##                       Max.   :982713000   Max.   :929332000  
##                       NA's   :1056                           
##  AssessedValue         SalePrice          Book/Page        
##  Min.   :0.000e+00   Min.   :0.000e+00   Length:115838     
##  1st Qu.:4.484e+05   1st Qu.:1.000e+00   Class :character  
##  Median :6.868e+05   Median :2.150e+05   Mode  :character  
##  Mean   :6.080e+06   Mean   :3.732e+06                     
##  3rd Qu.:1.159e+06   3rd Qu.:5.200e+05                     
##  Max.   :1.380e+09   Max.   :1.229e+09                     
##                                                            
##    SaleDate         PreviousAssessedValue  Owner_Name       
##  Length:115838      Min.   :0.000e+00     Length:115838     
##  Class :character   1st Qu.:3.956e+05     Class :character  
##  Mode  :character   Median :6.156e+05     Mode  :character  
##                     Mean   :4.698e+06                       
##                     3rd Qu.:1.031e+06                       
##                     Max.   :1.295e+09                       
##                     NA's   :528                             
##  Owner_CoOwnerName  Owner_Location     Owner_Address     
##  Length:115838      Length:115838      Length:115838     
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##  Owner_Address2      Owner_City        Owner_State       
##  Length:115838      Length:115838      Length:115838     
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##   Owner_Zip         Exterior_Style     Exterior_occupancy
##  Length:115838      Length:115838      Length:115838     
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##  Exterior_NumStories Exterior_WallType  Exterior_WallHeight
##  Min.   : 0.000      Length:115838      Min.   : 1.00      
##  1st Qu.: 1.000      Class :character   1st Qu.:10.00      
##  Median : 1.000      Mode  :character   Median :12.00      
##  Mean   : 1.615                         Mean   :11.24      
##  3rd Qu.: 2.250                         3rd Qu.:12.00      
##  Max.   :26.000                         Max.   :80.00      
##  NA's   :1979                           NA's   :104620     
##  Exterior_RoofType  Exterior_RoofMaterial Exterior_FloorLocation
##  Length:115838      Length:115838         Min.   : 1.00         
##  Class :character   Class :character      1st Qu.: 1.00         
##  Mode  :character   Mode  :character      Median : 2.00         
##                                           Mean   : 2.79         
##                                           3rd Qu.: 3.00         
##                                           Max.   :25.00         
##                                           NA's   :58906         
##  Exterior_View      Interior_LivingArea Interior_NumUnits
##  Length:115838      Min.   :     0      Min.   :     0   
##  Class :character   1st Qu.:   735      1st Qu.:     0   
##  Mode  :character   Median :  1170      Median :     1   
##                     Mean   :  3626      Mean   :   850   
##                     3rd Qu.:  2239      3rd Qu.:     2   
##                     Max.   :956664      Max.   :374995   
##                     NA's   :1979        NA's   :59929    
##  Interior_TotalRooms Interior_Bedrooms Interior_Kitchens 
##  Min.   :     0      Min.   :  0.000   Length:115838     
##  1st Qu.:     3      1st Qu.:  1.000   Class :character  
##  Median :     5      Median :  2.000   Mode  :character  
##  Mean   :   435      Mean   :  2.552                     
##  3rd Qu.:     7      3rd Qu.:  3.000                     
##  Max.   :374995      Max.   :495.000                     
##  NA's   :10359       NA's   :11252                       
##  Interior_FullBaths Interior_HalfBaths Interior_Fireplaces
##  Min.   :  0.000    Min.   : 0.000     Min.   : 0.000     
##  1st Qu.:  1.000    1st Qu.: 0.000     1st Qu.: 0.000     
##  Median :  1.000    Median : 0.000     Median : 0.000     
##  Mean   :  1.549    Mean   : 0.242     Mean   : 0.267     
##  3rd Qu.:  2.000    3rd Qu.: 0.000     3rd Qu.: 0.000     
##  Max.   :321.000    Max.   :40.000     Max.   :24.000     
##  NA's   :12191      NA's   :12151      NA's   :12151      
##  Interior_Flooring  Interior_Layout    Interior_LaundryInUnit
##  Length:115838      Length:115838      Length:115838         
##  Class :character   Class :character   Class :character      
##  Mode  :character   Mode  :character   Mode  :character      
##                                                              
##                                                              
##                                                              
##                                                              
##  Systems_HeatType   Systems_HeatFuel   Systems_CentralAir
##  Length:115838      Length:115838      Length:115838     
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##  Systems_Plumbing   Condition_YearBuilt Condition_InteriorCondition
##  Length:115838      Min.   :   0        Length:115838              
##  Class :character   1st Qu.:1884        Class :character           
##  Mode  :character   Median :1908        Mode  :character           
##                     Mean   :1697                                   
##                     3rd Qu.:1950                                   
##                     Max.   :2018                                   
##                     NA's   :1979                                   
##  Condition_OverallCondition Condition_OverallGrade  Parking_Open   
##  Length:115838              Length:115838          Min.   : 0.000  
##  Class :character           Class :character       1st Qu.: 0.000  
##  Mode  :character           Mode  :character       Median : 0.000  
##                                                    Mean   : 0.619  
##                                                    3rd Qu.: 1.000  
##                                                    Max.   :80.000  
##                                                    NA's   :12298   
##  Parking_Covered  Parking_Garage  UnfinishedBasementGross
##  Min.   : 0.000   Min.   : 0.00   Min.   : -904.0        
##  1st Qu.: 0.000   1st Qu.: 0.00   1st Qu.:    0.0        
##  Median : 0.000   Median : 0.00   Median :    0.0        
##  Mean   : 0.133   Mean   : 0.35   Mean   :  522.7        
##  3rd Qu.: 0.000   3rd Qu.: 1.00   3rd Qu.:  883.0        
##  Max.   :45.000   Max.   :99.00   Max.   :77288.0        
##  NA's   :12151    NA's   :56661   NA's   :1939           
##  FinishedBasementGross PropertyTaxAmount
##  Min.   : -950.0       Min.   :     -1  
##  1st Qu.:    0.0       1st Qu.:   2986  
##  Median :    0.0       Median :   4573  
##  Mean   :  122.8       Mean   :  19090  
##  3rd Qu.:    0.0       3rd Qu.:   7725  
##  Max.   :50516.0       Max.   :6406930  
##  NA's   :11218         NA's   :86703

Obviously there are a lot of variables here. I’ll explore many of these in more detail later, but some obvious things jump out at a glance. First, there are a lot of string values which should probably be coded as factors. This will make analysis a little bit easier. Second, there are definitely some garbage values in here. Property Tax Amount jumps out at the end here with more than half of properties showing a value of -1. And there are plenty of other weird values as well, such as basement sizes in the negatives, a Year Built of zero, one building with 374,995 units? That just seems implausible.

Raw Data Scrutiny

YearOfAssessment

What do the values for YearOfAssessment look like?

ggplot(rawPropertyData, aes(x = YearOfAssessment)) + geom_histogram(bins = 50)

That doesn’t look right. What do values look like outside of a sane range?

rawPropertyData %>% filter(YearOfAssessment > 2019 | YearOfAssessment <= 1900) %>% ggplot(aes(x = YearOfAssessment)) + geom_histogram(bins = 50)

And what about in a sane range?

rawPropertyData %>% filter(YearOfAssessment > 1900 & YearOfAssessment <= 2019) %>% ggplot(aes(x = YearOfAssessment)) + geom_histogram(binwidth = 1)

Okay, so clearly there are a good amount of records with garbage values. How many exactly?

rawPropertyData %>% transmute(SaneYOA = YearOfAssessment > 1990 & YearOfAssessment <= 2019) %>% count(SaneYOA)

In the scheme of things, not so many. We can decide later if it makes sense to drop these rows or just replace with some sort of NA value.

Now, let’s take a look at the date of sale.

rawPropertyData %>% select(SaleDate) %>% head(20)

Okay, there is obviously something strange happening here. Several have the date 01/05/0001. To me that looks like a code for unknown date. Others are dated 01/01/xxxxx where xxxxx is some very large number, usually in the 30,000 to 40,000 range. I’m guessing that’s the number of days since 1900.

rawPropertyData %>% select(SaleDate) %>% mutate(chars = str_length(SaleDate)) %>% filter(chars == 10) %>% count(SaleDate) %>% arrange(desc(n))

It appears another code for unknown date is 01/01/1900. I’m not quite sure what’s going on with 01/01/1998, that seems a little suspicious. Are there any sales before 1900?

rawPropertyData %>% mutate(SaleYear = as.integer(stri_extract_last_regex(SaleDate, "[0-9]{4,5}"))) %>% filter(SaleYear > 1 & SaleYear < 1900) %>% select(Owner_Name, SaleDate, Address)

Just some City of Cambridge properties, and the dates look legitimate. So, we can take any large years and convert them to dates and perhaps set dates with years less than 1800 to NA.

Data Cleanup

Strings to Factors

Let’s start small and convert those string values into factors.

factorCols <- c(
  "PID",
  "StateClassCode",
  "PropertyClass",
  "Zoning",
  "TaxDistrict",
  "ResidentialExemption",
  "Owner_State",
  "Exterior_Style",
  "Exterior_occupancy",
  "Exterior_WallType",
  "Exterior_RoofType",
  "Exterior_RoofMaterial",
  "Interior_Flooring",
  "Interior_Layout",
  "Interior_LaundryInUnit",
  "Systems_HeatType",
  "Systems_HeatFuel",
  "Systems_CentralAir",
  "Systems_Plumbing",
  "Condition_InteriorCondition",
  "Condition_OverallCondition",
  "Condition_OverallGrade"
)

propertyData <- rawPropertyData %>% mutate_at(factorCols, as.factor)

Now let’s get a shorter representation of the address and GPS coordinates from the Address field.

# propertyData %>% select(Address) %>% mutate(loc = regmatches(Address, regexpr("\\(.*\\)", Address)))

propertyData <- propertyData %>% mutate(
  GPSLoc = stri_extract_last_regex(Address, "\\(.*\\)"),
  ShortAddress = str_remove(Address, "\\n.*\\n.*"))

Let’s correct those bogus-looking dates.

# create a date object at Jan 1 1900 as reference for corrected dates
baseDate <- as_date("01/01/1900", tz = "UTC", format = "%m/%d/%Y")

propertyData <- propertyData %>%
  mutate(SaleYear = as.integer(stri_extract_last_regex(SaleDate, "[0-9]{4,5}"))) %>% 
  mutate(SaleDate = na_if(SaleDate, "01/05/0001")) %>% 
  mutate(SaleDate = na_if(SaleDate, "01/01/1900")) %>%
  mutate(SaleDate = if_else(
    SaleYear > 2100, 
    baseDate + as.period(SaleYear, unit = "days"), 
    as_date(SaleDate, tz = "UTC", format = "%m/%d/%Y")))

propertyData %>% count(SaleDate) %>% arrange(desc(n))

Data Exploration

Property ID

Which property ids appear the most?

propertyData %>% count(PID) %>% arrange(desc(n)) %>% head(n = 15) %>% 
  ggplot(aes(x = reorder(PID, n), y = n)) + geom_bar(stat = "identity") + 
  xlab("Property ID") + ylab("Number of Assessments")

What do the assessed properties look like from the three most common property ids?

propertyData %>% filter(PID == 11716)
propertyData %>% filter(PID == 21897)
propertyData %>% filter(PID == 12630)

Found Harvard and MIT!

Assessed Value

What is the overall distribution of AssessedValue?

ggplot(propertyData, aes(x = AssessedValue)) + geom_histogram(bins = 50) + scale_x_log10()
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Removed 10719 rows containing non-finite values (stat_bin).

What are the most valuable properties (by assessment) in Cambridge?

propertyData %>% arrange(desc(AssessedValue)) %>% select(PID, Owner_Name, ShortAddress, AssessedValue, Exterior_Style) %>% head(20)

Okay, that’s a boring answer. Of course the university properties are worth a billion dollars. What about excepting universities and colleges?

propertyData %>% arrange(desc(AssessedValue)) %>% filter(PropertyClass != "Private College, University" & PropertyClass != "Private College") %>% select(PID, Owner_Name, ShortAddress, AssessedValue, PropertyClass, Exterior_Style) %>% head(100)# %>% kable()

545 Technology Squre on top with $453M and $395M. Novartis also sitting on some valuable real estate. Some interesting properties lower in the data, such as the vacant utility authorities at #23 and #29, worth $305M and $274M. Residential property starts pretty low down the list at #51 and $226M, one of the Twenty20 buildings.

Property Class

What are the most common types of properties?

propertyData %>% count(PropertyClass) %>% arrange(desc(n))
propertyData %>% count(PropertyClass) %>% arrange(desc(n)) %>% head(n = 10) %>% ggplot(aes(x = reorder(PropertyClass, n), y = n)) + geom_bar(stat = "identity") + xlab("Property Class") + ylab("Number of Assessments") + theme(axis.text.x = element_text(angle = 45, hjust = 1))

Diving Into Single Family Residential Data

Okay, let’s just take a look at single family residential properties.

# let's look at just single unit residential properties
singleResUnitData <- propertyData %>% 
  filter(
    PropertyClass == "CONDOMINIUM" | PropertyClass == "SNGL-FAM-RES" | PropertyClass == "CNDO LUX"
    )

Highest Valued Single Family Properties

singleResUnitData %>% count(Interior_NumUnits)

What are the most valuable single family properties in the city?

singleResUnitData %>% arrange(desc(AssessedValue)) %>% filter(YearOfAssessment == 2018) %>% select(PID, Owner_Name, ShortAddress, AssessedValue, PropertyClass, Exterior_Style) %>% head(100)

The addresses sound about right.

Number of Assessments

How many assessments were performed each year?

singleResUnitData %>% count(YearOfAssessment)

How many assessments are available for each property?

singleResUnitData %>% count(PID) %>% ggplot(aes(x = n)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The vast majority have four assessments. Let’s create a tbl of all of the properties with four assessments.

singleResFourAssessmentPIDs <- singleResUnitData %>% 
  count(PID) %>% filter(n == 4) %>% select(PID)
singleResFourAssessments <- inner_join(singleResUnitData, singleResFourAssessmentPIDs)
## Joining, by = "PID"

Year Over Year Assessed Value

singleResYoYAssessedValue <- singleResFourAssessments %>% 
  select(PID, AssessedValue, YearOfAssessment) %>% 
  spread(YearOfAssessment, AssessedValue)

# singleResFourAssessments %>% ggplot(aes(x = YearOfAssessment, y = AssessedValue, group = PID)) + geom_line() + scale_y_log10()
singleResFourAssessments %>% ggplot(aes(x = YearOfAssessment, group = YearOfAssessment, y = AssessedValue)) + geom_boxplot() + scale_y_log10()

singleResFourAssessments %>% group_by(YearOfAssessment) %>% summarize(mean(AssessedValue))
singleResFourAssessments %>% select(PID, SalePrice, YearOfAssessment) %>% spread(YearOfAssessment, SalePrice)
singleResFourAssessments %>% group_by(PID) %>% summarize(std = sd(SalePrice) + 1e-5) %>% ggplot(aes(x = std)) + geom_histogram(bins = 50) + scale_x_log10()

# looking at only maximum assessed value for each property, how much does
# the number of bedrooms affect assessed value?
singleResUnitData %>% group_by(PID) %>% summarize_if(is.numeric, max) %>% 
  filter(AssessedValue > 0) %>% filter(Interior_Bedrooms <= 3) %>%
  ggplot(aes(x = AssessedValue)) + geom_histogram(bins = 50) + scale_x_log10() + 
  facet_grid(Interior_Bedrooms ~ ., scales = "free_y")

# looking at only maximum assessed value for each property, how much does
# the number of bedrooms affect assessed value?
singleResUnitData %>% group_by(PID) %>% summarize_if(is.numeric, max) %>% 
  filter(AssessedValue > 0) %>% filter(Interior_Bedrooms <= 3) %>%
  ggplot(aes(x = AssessedValue)) + geom_histogram(bins = 50) +
  scale_x_log10()

# what do single unit residential properties with a rational sale price
# sell for compared to assessed value?
singleResUnitData %>% filter(SalePrice > 100) %>%
  ggplot(aes(x = AssessedValue, y = SalePrice, col = Interior_Bedrooms)) + 
  geom_point() + scale_x_log10() + scale_y_log10()

singleResUnitData %>% group_by(PID) %>% summarize_if(is.numeric, max) %>% 
  filter(AssessedValue > 0) %>% mutate(Has_Fireplace = Interior_Fireplaces > 0) %>%
  ggplot(aes(x = AssessedValue)) + geom_histogram(bins = 50) + scale_x_log10() +
  facet_grid(Has_Fireplace ~ ., scales = "free_y")

Time From Last Sale

singleResUnitData %>% filter(YearOfAssessment == 2018) %>% 
  mutate(YearsSinceSold = 2018 - year(SaleDate)) %>% filter(YearsSinceSold > 0) %>%
  ggplot(aes(x = YearsSinceSold)) + geom_histogram(binwidth = 1)

Predicting Sale Price

Let’s initially look at properties listed as sold in 2014. I’ll use the 2016 assessments as they appear to be the most complete. I’ll also filter out extremely low valued sales although it may be interesting to see if we can classify when that type of sale will happen.

singleResUnitData %>% filter(YearOfAssessment == 2016) %>% filter(year(SaleDate) == 2014)
singleResUnitData %>% filter(YearOfAssessment == 2016) %>% filter(year(SaleDate) == 2014) %>% select_if(is.numeric) %>% summarise_all(sd) %>% t()
##                                 [,1]
## BldgNum                 0.000000e+00
## LandArea                2.268142e+03
## YearOfAssessment        0.000000e+00
## BuildingValue           3.534488e+05
## LandValue               2.931463e+05
## AssessedValue           5.008223e+05
## SalePrice               5.623093e+05
## PreviousAssessedValue   4.572806e+05
## Exterior_NumStories     7.783219e-01
## Exterior_WallHeight               NA
## Exterior_FloorLocation            NA
## Interior_LivingArea     6.973143e+02
## Interior_NumUnits                 NA
## Interior_TotalRooms     1.923948e+00
## Interior_Bedrooms       1.024363e+00
## Interior_FullBaths      7.292727e-01
## Interior_HalfBaths      4.814183e-01
## Interior_Fireplaces     6.723882e-01
## Condition_YearBuilt     1.257303e+02
## Parking_Open            7.701008e-01
## Parking_Covered         3.325217e-01
## Parking_Garage                    NA
## UnfinishedBasementGross 3.200294e+02
## FinishedBasementGross   2.376105e+02
## PropertyTaxAmount                 NA
## SaleYear                9.324429e+01

Some of these standard deviations are coming up NA or zero so we’ll have to cull them (for now at least) to get rational results.

singleResSaleTrain <- singleResUnitData %>% 
  filter(YearOfAssessment == 2016) %>% filter(year(SaleDate) == 2014) %>% 
  filter(SalePrice > 1000) %>% 
  mutate(Interior_TotalRooms = recode(Interior_TotalRooms, `0` = 1),
         InUnitWD = Interior_LaundryInUnit == 1) %>%
  select(-c(BldgNum, YearOfAssessment, Exterior_WallHeight, Exterior_FloorLocation,
            Interior_NumUnits, PropertyTaxAmount, Parking_Garage, SaleYear))
singleResSaleTrain %>% select_if(is.numeric) %>% cor() %>% .[,"SalePrice"] %>% sort(decreasing = TRUE)
##               SalePrice           AssessedValue   PreviousAssessedValue 
##              1.00000000              0.95730203              0.84151737 
##     Interior_LivingArea           BuildingValue               LandValue 
##              0.83037669              0.80751596              0.69812040 
##                LandArea     Interior_TotalRooms      Interior_FullBaths 
##              0.64863353              0.64222533              0.61179475 
##     Interior_Fireplaces       Interior_Bedrooms UnfinishedBasementGross 
##              0.60409800              0.58374500              0.54012092 
##      Interior_HalfBaths            Parking_Open     Exterior_NumStories 
##              0.45837096              0.43104066              0.40662074 
##         Parking_Covered   FinishedBasementGross     Condition_YearBuilt 
##              0.38509781              0.35216095             -0.03327682

As expected, AssessedValue and PreviousAssessedValue have very high correlation to the sale price. Interior_LivingArea is very highly correlated as well which shouldn’t be too surprising.

for (factorCol in factorCols) {
  if (factorCol != "PID") {
    print(factorCol)
    singleResSaleTrain %>% select(PID, SalePrice, factorCol) %>% mutate(one = 1) %>% spread(factorCol, one, fill = 0) %>% select(-PID) %>% cor() %>% .[,"SalePrice"] %>% sort(decreasing = TRUE) %>% print()
  }
}
## [1] "StateClassCode"
##   SalePrice         101        1021         102 
##  1.00000000  0.45386809  0.03706046 -0.38583711 
## [1] "PropertyClass"
##    SalePrice SNGL-FAM-RES     CNDO LUX  CONDOMINIUM 
##   1.00000000   0.45386809   0.03706046  -0.38583711 
## [1] "Zoning"
##    SalePrice          A-2          A-1            B          C-1 
##  1.000000000  0.493464548  0.228009911  0.177753670  0.097570037 
##          C-3          C-2            C         SD-2          O-1 
##  0.065494172  0.013429479  0.007343891 -0.006162950 -0.017796302 
##           BA         <NA> 
## -0.026337659 -0.397953184 
## [1] "TaxDistrict"
##    SalePrice          R10           R6          R11          R14 
##  1.000000000  0.438026152  0.167784252  0.166326699  0.118820706 
##           R9          R15          R16           R8           R4 
##  0.080227125  0.050117107  0.020045536  0.009295112  0.004945910 
##           R5          R17           R3          R1A          R13 
##  0.003370830 -0.026501863 -0.035182866 -0.040098555 -0.041477920 
##          R12           R1           R2           R7 
## -0.072763471 -0.081459706 -0.093449420 -0.150073844 
## [1] "ResidentialExemption"
##    SalePrice            1            0 
##  1.000000000  0.006350188 -0.006350188 
## [1] "Owner_State"
##    SalePrice           CO           ME           NY           MI 
##  1.000000000  0.352902598  0.127051362  0.021636078  0.008468584 
##           PA           --           GA           VA           CT 
##  0.001218041 -0.004500393 -0.006778033 -0.013543941 -0.016250305 
##           NH           NJ           TN           TX            - 
## -0.019147223 -0.019756276 -0.020002309 -0.021557880 -0.021847556 
##           FL           MD           VT           CA           MA 
## -0.023139230 -0.025599649 -0.034395282 -0.035554834 -0.054221470 
## [1] "Exterior_Style"
##        SalePrice         COLONIAL        VICTORIAN     CONVENTIONAL 
##      1.000000000      0.427072996      0.334634721      0.138025531 
##     CONTEMPORARY          ROW-END           SINGLE     SEMIDETACHED 
##      0.121263652      0.108869637      0.097298628      0.066579635 
##       TWO-FLOORS        TOWNHOUSE            OTHER         CAPE-COD 
##      0.043814846      0.042463923      0.041607030      0.036941118 
##              ROW            RANCH    FREE-STANDING         GARRISON 
##      0.032012397      0.031071628      0.021372631     -0.004932785 
##      TOWNHSE-END            ATTIC Commercial Condo         BASEMENT 
##     -0.015630126     -0.018471762     -0.023323755     -0.083686929 
##             FLAT        APARTMENT 
##     -0.144854633     -0.268979306 
## [1] "Exterior_occupancy"
##    SalePrice SNGL-FAM-RES         <NA> 
##    1.0000000    0.4538681   -0.4538681 
## [1] "Exterior_WallType"
##      SalePrice  Frame-Clapbrd          Brick   Wood Shingle Concrete Block 
##    1.000000000    0.344484651    0.279351777    0.150615224    0.091143116 
## Aluminum-Vinyl Asbstos Shingl          Other         Stucco           <NA> 
##    0.057735497    0.049015333    0.035246926   -0.008008198   -0.453868089 
## [1] "Exterior_RoofType"
##   SalePrice         Hip     Gambrel       Gable     Mansard        Flat 
##  1.00000000  0.40407433  0.24440946  0.22299574  0.16314565  0.08891257 
##        Shed        <NA> 
##  0.01110098 -0.45386809 
## [1] "Exterior_RoofMaterial"
##      SalePrice     Slate Clay Aspahlt Shingl          Other Rubber Membrai 
##     1.00000000     0.45948037     0.24998849     0.09515368     0.06512724 
##   Roll Roofing           <NA> 
##     0.03107163    -0.45386809 
## [1] "Interior_Flooring"
##      SalePrice           <NA> Cermc/Qry Tile       Softwood Vinyl/Linoleum 
##     1.00000000     0.43152723    -0.02154002    -0.02482636    -0.03187432 
##         Carpet       Hardwood 
##    -0.07443793    -0.34601312 
## [1] "Interior_Layout"
##   SalePrice        <NA>    Superior Corner Unit        Poor   Thru Unit 
##  1.00000000  0.42986596  0.04181349  0.03178377 -0.01940709 -0.14152009 
##   No Impact 
## -0.19519846 
## [1] "Interior_LaundryInUnit"
##  SalePrice       <NA>          1          0 
##  1.0000000  0.4538681 -0.0696832 -0.2861683 
## [1] "Systems_HeatType"
##     SalePrice         Other    Forced Air  Elec-Radiant    Space Heat 
##  1.0000000000  0.2230587252  0.1808004957 -0.0003196656 -0.0115141688 
##       H.V.A.C   HW Radiator     Elec Wall          <NA>     Hot Water 
## -0.0173405528 -0.0186111036 -0.0216308218 -0.0685689647 -0.0698772827 
##     Heat Pump         Steam 
## -0.0928303465 -0.1067346127 
## [1] "Systems_HeatFuel"
##  SalePrice        Gas        Oil   Electric       <NA> 
##  1.0000000  0.4171979  0.1473439  0.0506797 -0.4538681 
## [1] "Systems_CentralAir"
##  SalePrice          1          0       <NA> 
##  1.0000000  0.4504958  0.1746491 -0.4538681 
## [1] "Systems_Plumbing"
## Warning in cor(.): the standard deviation is zero
## SalePrice 
##         1 
## [1] "Condition_InteriorCondition"
##   SalePrice   Excellent        Good        Fair     Average   Very Good 
##  1.00000000  0.43744545  0.14377645  0.11667631  0.07758647  0.07658401 
##        <NA> 
## -0.45386809 
## [1] "Condition_OverallCondition"
##   SalePrice    Superior   Excellent        Fair        Poor     Average 
##  1.00000000  0.36362289  0.11760053  0.05898255 -0.01420357 -0.04973421 
##   Very Good        Good 
## -0.07650985 -0.10228084 
## [1] "Condition_OverallGrade"
##      SalePrice Estate Quality Good Very Good   Very Good Ex      Excellent 
##    1.000000000    0.353967910    0.322367077    0.274903062    0.267762467 
##      Very Good   Average Plus           Poor           Fair           Good 
##    0.182489517   -0.003498485   -0.013072140   -0.038241746   -0.132633012 
##        Average 
##   -0.245905467

There are quite a few interesting results here! One that jumps out immediately is that the conditions and grades don’t necessarily go in order. An interior condition of “Very Good” is less correlated with a high sale price than a condition of “Good”. It’s pretty obvious there are some correlations between that feature and other features which are skewing the correlations. It’s also unclear how many properties actually were given each ranking. In fact, let’s count up the instance of each level of the factor features.

# singleResSaleTrain %>% select(-PID) %>% select_if(is.factor) %>% sum()
for (factorCol in factorCols) {
  if (factorCol != "PID") {
    print(factorCol)
    singleResSaleTrain %>% select(factorCol) %>% count()
    # singleResSaleTrain %>% select(PID, SalePrice, factorCol) %>% mutate(one = 1) %>% spread(factorCol, one, fill = 0) %>% select(-PID) %>% cor() %>% .[,"SalePrice"] %>% sort(decreasing = TRUE) %>% print()
  }
}
## [1] "StateClassCode"
## [1] "PropertyClass"
## [1] "Zoning"
## [1] "TaxDistrict"
## [1] "ResidentialExemption"
## [1] "Owner_State"
## [1] "Exterior_Style"
## [1] "Exterior_occupancy"
## [1] "Exterior_WallType"
## [1] "Exterior_RoofType"
## [1] "Exterior_RoofMaterial"
## [1] "Interior_Flooring"
## [1] "Interior_Layout"
## [1] "Interior_LaundryInUnit"
## [1] "Systems_HeatType"
## [1] "Systems_HeatFuel"
## [1] "Systems_CentralAir"
## [1] "Systems_Plumbing"
## [1] "Condition_InteriorCondition"
## [1] "Condition_OverallCondition"
## [1] "Condition_OverallGrade"
singleResSaleTrain %>% group_by(Exterior_Style) %>% tally() %>% arrange(desc(n))

I think there may be some correlations between building materials and style. Let’s see how serious they are.

singleResSaleTrain %>% group_by(Exterior_Style, Exterior_WallType) %>% tally() %>% arrange(desc(n))
## Warning: Factor `Exterior_WallType` contains implicit NA, consider using
## `forcats::fct_explicit_na`

Actually, it appears the vast majority of sales are of apartments, flats, townhouses, etc which do not have exterior walls or they are not recorded.

singleResSaleTrain %>% group_by(Exterior_Style, Exterior_occupancy) %>% tally() %>% arrange(Exterior_Style, desc(n))
## Warning: Factor `Exterior_occupancy` contains implicit NA, consider using
## `forcats::fct_explicit_na`
singleResSaleTrain %>% group_by(Exterior_occupancy, Exterior_RoofType) %>% tally() %>% arrange(Exterior_RoofType)
## Warning: Factor `Exterior_occupancy` contains implicit NA, consider using
## `forcats::fct_explicit_na`
## Warning: Factor `Exterior_RoofType` contains implicit NA, consider using
## `forcats::fct_explicit_na`
## Warning: Factor `Exterior_occupancy` contains implicit NA, consider using
## `forcats::fct_explicit_na`

## Warning: Factor `Exterior_occupancy` contains implicit NA, consider using
## `forcats::fct_explicit_na`
singleResSaleTrain %>% filter(as.character(Exterior_occupancy) != "SNGL-FAM-RES" | is.na(Exterior_occupancy))# %>% summarize_all(funs(sum(is.na(.)))) %>% t()
for (factorCol in factorCols) {
  if (factorCol != "PID") {
    print(factorCol)
    singleResSaleTrain %>% 
      filter(as.character(Exterior_occupancy) == "SNGL-FAM-RES") %>% 
      select(PID, SalePrice, factorCol) %>% mutate(one = 1) %>% 
      spread(factorCol, one, fill = 0) %>% select(-PID) %>% 
      cor() %>% .[,"SalePrice"] %>% sort(decreasing = TRUE) %>% 
      print()
  }
}
## [1] "StateClassCode"
## Warning in cor(.): the standard deviation is zero
## SalePrice 
##         1 
## [1] "PropertyClass"
## Warning in cor(.): the standard deviation is zero
## SalePrice 
##         1 
## [1] "Zoning"
##   SalePrice         A-2         A-1         C-3        <NA>         C-2 
##  1.00000000  0.55615298  0.27505796  0.04458499  0.03129629 -0.06069310 
##        SD-2         O-1           C           B         C-1 
## -0.06716139 -0.07003898 -0.13273946 -0.18947486 -0.19222867 
## [1] "TaxDistrict"
##   SalePrice         R10          R6         R14          R3         R16 
##  1.00000000  0.53026907  0.32391108  0.17204876  0.04114921  0.03892573 
##          R9         R15          R4         R11          R8         R12 
##  0.02725509  0.02060508  0.01573571 -0.04170017 -0.07490416 -0.10121535 
##         R17          R2         R13          R1          R7 
## -0.10419031 -0.11164874 -0.14267194 -0.18777370 -0.33155425 
## [1] "ResidentialExemption"
##  SalePrice          0          1 
##  1.0000000  0.1301559 -0.1301559 
## [1] "Owner_State"
##   SalePrice          CO          NY          --          NH          CA 
##  1.00000000  0.49444808  0.01189068 -0.01872197 -0.07339616 -0.09637219 
##          MA 
## -0.10966525 
## [1] "Exterior_Style"
##    SalePrice     COLONIAL    VICTORIAN CONTEMPORARY      ROW-END 
##  1.000000000  0.435012257  0.299161350  0.108033551  0.055042910 
##        RANCH        OTHER     CAPE-COD     GARRISON          ROW 
## -0.009095571 -0.047497861 -0.058268591 -0.065242994 -0.066032977 
##  TOWNHSE-END    TOWNHOUSE CONVENTIONAL 
## -0.169153091 -0.201823100 -0.315036054 
## [1] "Exterior_occupancy"
## Warning in cor(.): the standard deviation is zero
## SalePrice 
##         1 
## [1] "Exterior_WallType"
##      SalePrice          Brick  Frame-Clapbrd Concrete Block          Other 
##     1.00000000     0.25236388     0.10972323     0.08458348    -0.02655564 
## Asbstos Shingl         Stucco   Wood Shingle Aluminum-Vinyl 
##    -0.05324821    -0.07003898    -0.10862346    -0.18222837 
## [1] "Exterior_RoofType"
##    SalePrice          Hip      Gambrel      Mansard         Flat 
##  1.000000000  0.426213845  0.268563513  0.064714966 -0.002745705 
##         Shed        Gable 
## -0.149648912 -0.346627889 
## [1] "Exterior_RoofMaterial"
##      SalePrice     Slate Clay          Other   Roll Roofing Rubber Membrai 
##    1.000000000    0.479088666    0.067179639   -0.009095571   -0.013866012 
## Aspahlt Shingl 
##   -0.443698828 
## [1] "Interior_Flooring"
## Warning in cor(.): the standard deviation is zero
## SalePrice 
##         1 
## [1] "Interior_Layout"
## Warning in cor(.): the standard deviation is zero
## SalePrice 
##         1 
## [1] "Interior_LaundryInUnit"
## Warning in cor(.): the standard deviation is zero
## SalePrice 
##         1 
## [1] "Systems_HeatType"
##    SalePrice        Other   Forced Air Elec-Radiant        Steam 
##   1.00000000   0.24973498   0.11665633  -0.05804902  -0.07400324 
##    Hot Water 
##  -0.13587871 
## [1] "Systems_HeatFuel"
##   SalePrice         Gas    Electric         Oil 
##  1.00000000  0.01054795 -0.00240819 -0.01017293 
## [1] "Systems_CentralAir"
##  SalePrice          1          0 
##  1.0000000  0.3140731 -0.3140731 
## [1] "Systems_Plumbing"
## Warning in cor(.): the standard deviation is zero
## SalePrice 
##         1 
## [1] "Condition_InteriorCondition"
##   SalePrice   Excellent        Fair     Average   Very Good        Good 
##  1.00000000  0.32969302  0.04129457 -0.11722781 -0.12704186 -0.18623783 
## [1] "Condition_OverallCondition"
##   SalePrice    Superior   Excellent        Fair     Average   Very Good 
##  1.00000000  0.45636459  0.14898957  0.02305601 -0.04194242 -0.10566094 
##        Good 
## -0.22385606 
## [1] "Condition_OverallGrade"
##      SalePrice Estate Quality      Excellent      Very Good   Very Good Ex 
##     1.00000000     0.49444808     0.35056863     0.34738030     0.30382588 
## Good Very Good   Average Plus           Poor           Good           Fair 
##     0.21450636    -0.08718006    -0.10215983    -0.13468674    -0.19886541 
##        Average 
##    -0.32919019
for (factorCol in factorCols) {
  if (factorCol != "PID") {
    print(factorCol)
    singleResSaleTrain %>% 
      filter(as.character(Exterior_occupancy) != "SNGL-FAM-RES" | is.na(Exterior_occupancy)) %>% 
      select(PID, SalePrice, factorCol) %>% mutate(one = 1) %>% 
      spread(factorCol, one, fill = 0) %>% select(-PID) %>% 
      cor() %>% .[,"SalePrice"] %>% sort(decreasing = TRUE) %>% 
      print()
  }
}
## [1] "StateClassCode"
##  SalePrice       1021        102 
##  1.0000000  0.1746209 -0.1746209 
## [1] "PropertyClass"
##   SalePrice    CNDO LUX CONDOMINIUM 
##   1.0000000   0.1746209  -0.1746209 
## [1] "Zoning"
##   SalePrice        <NA>           B         O-1           C          BA 
##  1.00000000  0.04263609  0.02390421 -0.01219114 -0.01747849 -0.03286140 
##         C-1 
## -0.05726753 
## [1] "TaxDistrict"
##    SalePrice          R11          R10          R16           R9 
##  1.000000000  0.307633640  0.270088570  0.043850510  0.043542008 
##           R8           R6          R1A           R5          R17 
##  0.030623345  0.026802738  0.024870189  0.015332130  0.011480866 
##          R13          R14           R4           R3           R1 
##  0.006124646  0.002560371 -0.005545813 -0.030978824 -0.039599006 
##          R12           R2           R7 
## -0.047894406 -0.090967556 -0.154776056 
## [1] "ResidentialExemption"
##   SalePrice           1           0 
##  1.00000000  0.03926992 -0.03926992 
## [1] "Owner_State"
##    SalePrice           CO           ME           MI           PA 
##  1.000000000  0.244825125  0.220081600  0.027697645  0.011839845 
##           NY           GA           VA           NH           CT 
##  0.002116040 -0.001131498 -0.012107250 -0.012402937 -0.016497551 
##           TX           NJ           TN           --           MD 
## -0.021016090 -0.022184986 -0.022584104 -0.023557677 -0.024429054 
##            -           FL           CA           MA           VT 
## -0.025577491 -0.027672862 -0.028798843 -0.037158726 -0.041842977 
## [1] "Exterior_Style"
##        SalePrice       TWO-FLOORS           SINGLE        TOWNHOUSE 
##       1.00000000       0.18629841       0.17497417       0.14822622 
##     SEMIDETACHED            OTHER    FREE-STANDING      TOWNHSE-END 
##       0.14396342       0.09285234       0.05892525       0.02296720 
##            ATTIC Commercial Condo             FLAT         BASEMENT 
##      -0.01286401      -0.02797220      -0.09437659      -0.10148348 
##        APARTMENT 
##      -0.18673767 
## [1] "Exterior_occupancy"
## Warning in cor(.): the standard deviation is zero
## SalePrice 
##         1 
## [1] "Exterior_WallType"
## Warning in cor(.): the standard deviation is zero
## SalePrice 
##         1 
## [1] "Exterior_RoofType"
## Warning in cor(.): the standard deviation is zero
## SalePrice 
##         1 
## [1] "Exterior_RoofMaterial"
## Warning in cor(.): the standard deviation is zero
## SalePrice 
##         1 
## [1] "Interior_Flooring"
##      SalePrice       Hardwood       Softwood Cermc/Qry Tile Vinyl/Linoleum 
##     1.00000000     0.09559675    -0.01605328    -0.02507859    -0.03460975 
##           <NA>         Carpet 
##    -0.05736854    -0.06698034 
## [1] "Interior_Layout"
##    SalePrice  Corner Unit     Superior    Thru Unit         Poor 
##  1.000000000  0.150749197  0.077694357 -0.001147911 -0.009381232 
##         <NA>    No Impact 
## -0.052504954 -0.088404924 
## [1] "Interior_LaundryInUnit"
##  SalePrice          1          0 
##  1.0000000  0.2994261 -0.2994261 
## [1] "Systems_HeatType"
##    SalePrice   Forced Air    Elec Wall  HW Radiator      H.V.A.C 
##  1.000000000  0.284747170  0.004698664  0.004189207 -0.003904207 
##   Space Heat    Heat Pump         <NA>        Steam    Hot Water 
## -0.008814524 -0.074137119 -0.081571156 -0.144316175 -0.150656296 
## [1] "Systems_HeatFuel"
## Warning in cor(.): the standard deviation is zero
## SalePrice 
##         1 
## [1] "Systems_CentralAir"
## Warning in cor(.): the standard deviation is zero
## SalePrice 
##         1 
## [1] "Systems_Plumbing"
## Warning in cor(.): the standard deviation is zero
## SalePrice 
##         1 
## [1] "Condition_InteriorCondition"
## Warning in cor(.): the standard deviation is zero
## SalePrice 
##         1 
## [1] "Condition_OverallCondition"
##    SalePrice    Excellent         Poor         Fair    Very Good 
##  1.000000000  0.215873693 -0.009084755 -0.030600922 -0.036106529 
##      Average         Good 
## -0.128264658 -0.144040027 
## [1] "Condition_OverallGrade"
##   SalePrice   Excellent   Very Good        Good        Fair     Average 
##  1.00000000  0.43626835  0.23764798 -0.06057705 -0.08371644 -0.25037003
singleResSaleTrain %>% 
  filter(as.character(Exterior_occupancy) == "SNGL-FAM-RES") %>% 
  select_if(is.numeric) %>% cor() %>% .[,"SalePrice"] %>% sort(decreasing = TRUE)
##               SalePrice           AssessedValue   PreviousAssessedValue 
##               1.0000000               0.9737212               0.9400675 
##           BuildingValue               LandValue     Interior_LivingArea 
##               0.9126605               0.8771784               0.8448729 
##     Interior_Fireplaces                LandArea     Interior_TotalRooms 
##               0.7305289               0.7202855               0.6453451 
##      Interior_FullBaths       Interior_Bedrooms UnfinishedBasementGross 
##               0.6286787               0.4852795               0.4505377 
##            Parking_Open         Parking_Covered      Interior_HalfBaths 
##               0.4192400               0.4179977               0.4140147 
##     Exterior_NumStories   FinishedBasementGross     Condition_YearBuilt 
##               0.3920473               0.2082721              -0.1245507
singleResSaleTrain %>% 
  filter(as.character(Exterior_occupancy) != "SNGL-FAM-RES" | is.na(Exterior_occupancy)) %>% 
  select_if(is.numeric) %>% select(-c(LandArea, LandValue)) %>% cor() %>% .[,"SalePrice"] %>% 
  sort(decreasing = TRUE)
##               SalePrice           BuildingValue           AssessedValue 
##              1.00000000              0.91508242              0.91508242 
##     Interior_LivingArea   PreviousAssessedValue      Interior_FullBaths 
##              0.70852811              0.64067864              0.53914213 
##     Interior_TotalRooms       Interior_Bedrooms      Interior_HalfBaths 
##              0.47969877              0.47008078              0.35268607 
##   FinishedBasementGross     Interior_Fireplaces     Exterior_NumStories 
##              0.32622401              0.29460666              0.29417724 
## UnfinishedBasementGross            Parking_Open     Condition_YearBuilt 
##              0.20572019              0.17026747             -0.01757395 
##         Parking_Covered 
##             -0.03608285
singleResSaleTrain %>% 
  filter(as.character(Exterior_occupancy) == "SNGL-FAM-RES") %>% 
  select(SalePrice, Interior_Fireplaces) %>%
  mutate(Has_Fireplace = Interior_Fireplaces > 0, 
         TwoPlus_FP = Interior_Fireplaces > 1,
         Multi_FP = Interior_Fireplaces > 2) %>%
  cor() %>% .[,"SalePrice"] %>% 
  sort(decreasing = TRUE)
##           SalePrice Interior_Fireplaces            Multi_FP 
##           1.0000000           0.7305289           0.6076482 
##          TwoPlus_FP       Has_Fireplace 
##           0.5918606           0.4716837

I think it makes sense to try developing two seperate models for sale price before attempting to combine the two.

Modeling Attached Single Family Sale Values

asfTrain <- singleResSaleTrain %>% 
  filter(as.character(Exterior_occupancy) != "SNGL-FAM-RES" | is.na(Exterior_occupancy))

head(asfTrain)
data {
  int<lower=0> n;
  real area[n];
  real sale_price[n];
}
parameters {
  real a;
  real b_area;
  real<lower=0> sig;
}
model {
  a ~ normal(500000, 10000);
  b_area ~ normal(0, 10);
  sig ~ cauchy(0, 10);
  
  for (i in 1:n) {
    sale_price ~ normal(a + b_area * area[i], sig);
  }
}
subset_train <- asfTrain[1:500,]

prop_dat <- list(
  n = nrow(subset_train),
  area = as.numeric(subset_train$Interior_LivingArea),
  rooms = subset_train$Interior_TotalRooms,
  bedrooms = subset_train$Interior_Bedrooms,
  halfbaths = subset_train$Interior_HalfBaths,
  fullbaths = subset_train$Interior_FullBaths,
  sale_price = subset_train$SalePrice
)

att_area_fit <- sampling(att_area, data = prop_dat)
print(att_area_fit)
## Inference for Stan model: 4578c88cf133b09aa230a27c880f48da.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##               mean se_mean      sd        2.5%        25%        50%
## a        638757.78   37.53 1758.35   635309.79   637607.2   638744.6
## b_area        3.43    0.03    1.45        0.62        2.5        3.4
## sig      350556.45   15.70  488.46   349635.09   350222.1   350551.6
## lp__   -3316941.80    0.03    1.18 -3316944.81 -3316942.4 -3316941.5
##                75%       97.5% n_eff Rhat
## a        639909.06   642229.19  2195    1
## b_area        4.39        6.32  2280    1
## sig      350868.90   351558.78   968    1
## lp__   -3316940.91 -3316940.43  1447    1
## 
## Samples were drawn using NUTS(diag_e) at Sat Mar 30 22:51:33 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).
plot(att_area_fit)
## ci_level: 0.8 (80% intervals)
## outer_level: 0.95 (95% intervals)

data {
  int<lower=0> n;
  real rooms[n];
  real bedrooms[n];
  real sale_price[n];
}
transformed data {
  real otherrooms[n];
  for (i in 1:n) {
    otherrooms[i] = rooms[i] - bedrooms[i];
  }
}
parameters {
  real a;
  real b_bedrooms;
  real b_otherrooms;
  real<lower=0> sig;
}
model {
  a ~ normal(500000, 10000);
  b_bedrooms ~ normal(0, 10000);
  b_otherrooms ~ normal(0, 10000);
  sig ~ cauchy(0, 10);
  
  for (i in 1:n) {
    sale_price ~ normal(a + b_bedrooms * bedrooms[i] + b_otherrooms * otherrooms[i], sig);
  }
}
subset_train <- asfTrain[1:500,]

prop_dat <- list(
  n = nrow(subset_train),
  area = as.numeric(subset_train$Interior_LivingArea),
  rooms = subset_train$Interior_TotalRooms,
  bedrooms = subset_train$Interior_Bedrooms,
  halfbaths = subset_train$Interior_HalfBaths,
  fullbaths = subset_train$Interior_FullBaths,
  sale_price = subset_train$SalePrice
)

att_rooms_fit <- sampling(att_rooms, data = prop_dat)
print(att_rooms_fit)
## Inference for Stan model: 6d17a87868e93b20c80e2663ebcc758f.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##                     mean se_mean      sd       2.5%         25%
## a              635860.69   49.02 2279.42   631325.3   634382.25
## b_bedrooms       1456.38   19.18  907.50     -335.8      827.23
## b_otherrooms     1284.91   15.45  709.80      -68.3      803.49
## sig            350580.18    8.97  495.90   349605.8   350243.48
## lp__         -3316940.23    0.04    1.39 -3316943.7 -3316940.95
##                      50%         75%       97.5% n_eff Rhat
## a              635856.15   637390.27   640280.45  2162    1
## b_bedrooms       1469.18     2088.33     3197.36  2239    1
## b_otherrooms     1274.23     1768.41     2749.14  2112    1
## sig            350571.55   350915.93   351546.41  3054    1
## lp__         -3316939.95 -3316939.20 -3316938.46  1439    1
## 
## Samples were drawn using NUTS(diag_e) at Sat Mar 30 22:56:52 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).
data {
  int<lower=0> n;
  real rooms[n];
  real bedrooms[n];
  real area[n];
  real sale_price[n];
}
transformed data {
  real otherrooms[n];
  real roomsize[n];
  for (i in 1:n) {
    otherrooms[i] = rooms[i] - bedrooms[i];
    roomsize[i] = area[i] / rooms[i];
  }
}
parameters {
  real a;
  real b_bedrooms;
  real b_otherrooms;
  real b_roomsize;
  real<lower=0> sig;
}
model {
  a ~ normal(500000, 10000);
  b_bedrooms ~ normal(0, 10000);
  b_otherrooms ~ normal(0, 10000);
  b_roomsize ~ normal(0, 1000);
  sig ~ cauchy(0, 10);
  
  for (i in 1:n) {
    sale_price ~ normal(a + b_bedrooms * bedrooms[i] + b_otherrooms * otherrooms[i] + b_roomsize * roomsize[i], sig);
  }
}
print(att_rooms_area_fit)
## Inference for Stan model: a993496f44f32b13b1135e5c7fcb7e7b.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##                     mean se_mean      sd        2.5%         25%
## a              631853.17   39.04 2259.77   627530.40   630288.74
## b_bedrooms        642.47   18.02  672.15     -637.67      172.13
## b_otherrooms     1019.88   10.75  554.51      -80.23      640.88
## b_roomsize         11.67    0.08    5.57        0.25        8.03
## sig            352941.37    8.30  352.49   352255.87   352697.22
## lp__         -6504396.44    0.04    1.48 -6504400.07 -6504397.23
##                      50%         75%       97.5% n_eff Rhat
## a              631793.51   633379.38   636307.84  3350    1
## b_bedrooms        655.67     1104.75     1930.03  1392    1
## b_otherrooms     1030.34     1398.22     2075.59  2663    1
## b_roomsize         11.66       15.44       22.39  4375    1
## sig            352935.75   353186.92   353635.20  1802    1
## lp__         -6504396.15 -6504395.36 -6504394.46  1677    1
## 
## Samples were drawn using NUTS(diag_e) at Sat Mar 30 23:11:41 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).
data {
  int<lower=0> n;
  real rooms[n];
  real bedrooms[n];
  real area[n];
  real sale_price[n];
  real wd[n];
}
transformed data {
  real otherrooms[n];
  real roomsize[n];
  for (i in 1:n) {
    otherrooms[i] = rooms[i] - bedrooms[i];
    roomsize[i] = area[i] / rooms[i];
  }
}
parameters {
  real a;
  real b_bedrooms;
  real b_otherrooms;
  real b_roomsize;
  real b_wd;
  real<lower=0> sig;
}
model {
  a ~ normal(500000, 10000);
  b_bedrooms ~ normal(0, 10000);
  b_otherrooms ~ normal(0, 10000);
  b_roomsize ~ normal(0, 1000);
  b_wd ~ normal(0, 1000);
  sig ~ cauchy(0, 10);
  
  for (i in 1:n) {
    sale_price ~ normal(a + b_bedrooms * bedrooms[i] + b_otherrooms * otherrooms[i] + b_roomsize * roomsize[i] + b_wd * wd[i], sig);
  }
}
print(att_rooms_area_wd_fit)
## Inference for Stan model: e97e93a1d75b860a2d2a45a831123245.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##                    mean se_mean      sd       2.5%        25%        50%
## a             575382.73  108.21 5309.96  564942.09  571729.45  575349.51
## b_bedrooms      2210.26   31.86 1602.44    -851.09    1106.20    2222.23
## b_otherrooms    4437.91   28.39 1398.58    1632.50    3515.63    4446.85
## b_roomsize        51.06    0.26   14.31      23.42      41.20      51.13
## b_wd              52.50   21.68  957.17   -1920.17    -566.51      62.37
## sig           258249.34   13.79  921.87  256419.18  257626.96  258263.29
## lp__         -518506.57    0.05    1.70 -518510.73 -518507.49 -518506.27
##                     75%      97.5% n_eff Rhat
## a             578957.48  585688.00  2408    1
## b_bedrooms      3260.42    5365.73  2530    1
## b_otherrooms    5353.74    7180.09  2426    1
## b_roomsize        60.80      80.01  2954    1
## b_wd             704.35    1890.35  1950    1
## sig           258870.93  260020.92  4468    1
## lp__         -518505.32 -518504.19  1287    1
## 
## Samples were drawn using NUTS(diag_e) at Sat Mar 30 23:14:16 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

Visualizing Property Locations

# what are the physical locations of properties?